VarClass: An Open-source Language Identification Tool for Language Varieties

نویسندگان

  • Marcos Zampieri
  • Binyam Gebrekidan Gebre
چکیده

This paper presents VarClass, an open-source tool for language identification available both to be downloaded as well as through a graphical user-friendly interface. The main difference of VarClass in comparison to other state-of-the-art language identification tools is its focus on language varieties. General purpose language identification tools do not take language varieties into account and our work aims to fill this gap. VarClass currently contains language models for over 27 languages in which 10 of them are language varieties. We report an average performance of over 90.5% accuracy in a challenging dataset. More language models will be included in the upcoming

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Persian-Speaking Teachers’ Perspectives on Methods and Materials for Teaching English as an International Language

Despite the global spread of English, it seems that voices from Persian-speaking teachers concerning English as an international language (EIL) teaching methods and materials are underrepresented. The present study set out to explore how nonnative Persian-speaking English language teachers respond to the increasing global dominance of EIL and native- and non-native-speakers’ language norms with...

متن کامل

Leveraging the open source ispell codebase for minority language analysis

The ispell family of spellcheckers is perhaps the single most widely ported and deployed open-source language tool. Here we describe how the SzóSzablya ‘WordSword’ project leverages ispell’s Hungarian descendant, HunSpell, to create a whole set of related tools that tackle a wide range of low-level NLP-related tasks such as character set normalization, language detection, spellchecking, stemmin...

متن کامل

A Rule Based Pronunciation Generator and Regional Accent Databank for Portuguese

One of the major obstacles in deploying spoken language technologies (SLTs) in the developing world is a lack of key linguistic resources – e.g. electronic dictionaries, phonetically aligned corpora, pronunciation lexicons, etc. – that describe the non-dominant varieties spoken in such countries and regions. In this paper, we describe the work of the LUPo (Portuguese Unisyn Lexicon) project to ...

متن کامل

CorpusCollie - A Web Corpus Mining Tool for Resource-Scarce Languages

This paper describes CORPUSCOLLIE, an open-source software package that is geared towards the collection of clean web corpora of resource-scarce languages. CORPUSCOLLIE uses a wide range of information sources to find, classify and clean documents for a given target language. One of the most powerful components in CORPUSCOLLIE is a maximum-entropy based language identification module that is ab...

متن کامل

The Potentiality of Dynamic Assessment in Massive Open Online Courses (MOOCs): The Case of Listening Comprehension MOOCs

Massive Open Online Courses (MOOCs) as a new shaking educational development provide the scene for achieving social inclusion and dissemination of knowledge. Anyhow, facilitating network learning experiences through creating an adaptive learning environment can pave the way for this open and energetic way to learning. The present study aimed to explore the possible role of Dynamic Assessment (D...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014